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Logistic Regression 


Roadmap 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 
© How Can Machines Learn? 









Lecture 9: Linear Regression 





analytic solution Wun = X'y with 
linear regression hypotheses and squared error 


Lecture 10: Logistic Regression 


e Logistic Regression Problem 

e Logistic Regression Error 

e Gradient of Logistic Regression Error 
e Gradient Descent 





@ How Can Machines Learn Better? 
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Logistic Regression Logistic Regression Problem 


Heart Attack Prediction Problem (1/2) 





























age 40 years 
gender male 
blood pressure 130/85 
cholesterol level 240 
unknown target weight 70 














distribution P(y|x) 
containing f(x) + noise 


heart disease? yes 




















training examples bade final hypothesis 
D: (X1, Y1), , (XN, YN) = gaf 
err 






















error measure 
err 


hypothesis set 
H 


binary classification: 
ideal f(x) = sign (P(+1|x) — 3) € {-1,+1} 
because of classification err 
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Logistic Regression Logistic Regression Problem 


Heart Attack Prediction Problem (2/2) 
































age 40 years 
gender male 
blood pressure 130/85 
cholesterol level 240 
Pe le a weight 70 
istribution x 


























training examples bead final hypothesis 
D: (1, 1), s (XN Yn) g get 
err 






















error measure 
err 


hypothesis set 
H 


‘soft binary classification: 


f(x) = P(+1|x) € [0,1] 





Logistic Regression Logistic Regression Problem 


Soft Binary Classification 


target function f(x) = P(+1|x) € [0,1] 


actual (noisy) data 


Be —o sem 






ideal (noiseless) data 
m — 09 o 










=0.2 = P(+1|x2) X2, y2 =x ~ P(y|X2) 





(xn. x ~ P(y|xw)) 


same data as hard binary classification, 
different target function 


206 EI ix) 





Logistic Regression Logistic Regression Problem 


Soft Binary Classification 


target function f(x) = P(+1|x) € [0,1] 


ideal (noiseless) data 


m =09 o 







actual (noisy) data 





e =F Cie) 








206 = lacs ix) 
same data as hard binary classification, 
different target function 


Logistic Regression Logistic Regression Problem 


Logistic Hypothesis 




















age 40 years 
gender male 
blood pressure 130/85 
cholesterol level 240 











e For X = (X0, X1, X2,: +- , Xa) ‘features of 
patient’, calculate a weighted ‘risk score’: 








e convert the score to estimated probability 
by logistic function 0(s) 





logistic hypothesis: h(x) = 0(w'x) 
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Logistic Regression Problem 


Logistic Function 


Logistic Regression 











(ou 0(0) = z; 6(co) = 1 
es 1 
oe Jace” (Mens 







—smooth, monotonic, sigmoid function of s 


logistic regression: use 
n(x) = o 
1 + exp(-w'x) 
to approximate target function f(x) = P(+1|x) 





Logistic Regression Logistic Regression Problem 


Fun Time 





Logistic Regression and Binary Classification 


Consider any logistic hypothesis A(x) = (ones that approximates | 
P(y|x). ‘Convert’ h(x) to a binary classification prediction by taking 
sign (h(x) — 4). What is the equivalent formula for the binary 
classification prediction? 

O sign (w’x— 3) 

@ sign (w’x) 

© sign (w'x 4) 

© none of the above 













Reference Answer: (2) 


When w'x = 0, h(x) is exactly 5. So 
thresholding h(x) at $ is the same as 
thresholding (wx) at 0. 
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Logistic Regression Logistic Regression Error 


Three Linear Models 


linear scoring function: s = w’x 








linear classification linear regression 





| logistic regression 







h(x) = sign(s) 


=o 


plausible err = 0/1 
(small flipping noise) 


x1 

s 
Dm 
Xd 


friendly err < sguared 
(easy to minimize) 


how to define 
E;,(W) for logistic regression? | 
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Logistic Regression Logistic Regression Error 


Likelihood 


target function J E(x) for y = +1 
= Pix) = { 1- f(x) fory=-1 


f(x) = P(+1|x) 





consider D = {(X1,°), (X2, x),...,(Xy, x)} 


probability that 










likelihood that 





generates D 


P(x1)P(o|x4) x 
P(X2)P(<|X2) x 


generates D 
P(x1)h(x1) x 
P(x2)(1 — A(X2)) x 









P(xn)P(X (Xn) P(Xn)(1 — A(xw)) 


e f haf, 
then likelihood(/) ~ probability using f 
e probability using f usually large 





Logistic Regression Logistic Regression Error 


Likelihood 


target function DAK for y = +1 
E PUR=| 


1- f(x) fory = -—1 





consider D = {(X1,°), (X2, x),...,(Xy, x)} 


likelihood that 









probability that 





generates D generates D 


P(x1)h(x1) x 
P(x2)(1 — h(x2)) x 


P(x1)f(X1) x 
P(X2)(1 — f(X2)) x 


P(xw)(1 — F(xw)) 









P(xw)(1 — (xn) 


e f haf, 
then likelihood(/) ~ probability using f 
e probability using f usually large 





Logistic Regression Logistic Regression Error 


Likelihood of Logistic Hypothesis 


likelinood(/) ~ (probability using f) ~ large 


g =argmax likelihood(/) 
h 








when logistic: 





(x) = 0 





x) 












likelinood(h) = P(x1)h(x1) x P(x2)(1 — h(x2)) x ... P(xn)(1 — h(xn)) 





N 
likelihood(logistic A) œ |] 4(ynxn) 


n=1 
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Logistic Regression Logistic Regression Error 


Likelihood of Logistic Hypothesis 


likelinood(/) ~ (probability using f) ~ large 





g =argmax likelihood(/) 
h 


when logistic: 





(x) = 0 


x) 









likelihood(h) = P(x; )h(-+%1) x P(x2)h(—x>) x... P(xn)h(—xw) 





N 
likelihood(logistic A) œ | [| /(ynxn) 


n=1 
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max likelihood(logistic h) x | [ 2Uxn) 


n=1 








N 
max. likelihood(w) x [[ 4 (vw? xn) 


n=1 








max In II 0 (vnwxn) 


n=1 





Logistic Regression Logistic Regression Error 


Cross-Entropy Error 


UE H SEn 0 (vrw'xn) 
=I 








err(w, X, y) = In (1 + exp(—ywx)): 
cross-entropy error 
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Logistic Regression Logistic Regression Error 


Fun Time 


The four statements below help us understand more about the 
cross-entropy error err(w, x, y) = In (1 + exp(—yw?x)) . Consider 
w!x 40. Which statement is not true? 


© For any w,x, and y, err(w,x, y) > 0. 

@ For any w,x, and y, err(w,x, y) < 1126. 
© When y = sign (w’x), err(w,x, y) < In2. 
© When y # sign (w’x), err(w,x, y) > In2. 


Reference Answer: A 


1126, really? :-) You are highly encouraged to 
plot the curve of err with respect to some fixed | 
y and some varying score s = w’x to know 
more about the error measure. After plotting, it | 
is easy to see that err is not bounded above, 
and the other three choices are correct. ] 
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Logistic Regression Gradient of Logistic Regression Error 


Minimizing Ein(w) 


N 
h 1 
min Ein(w) = z ih (1 Je exp(-yrw?xn)) 
n=1 





e E;(w): continuous, differentiable, 
twice-differentiable, convex 


e how to minimize? locate valley 


want V Ejn(w) = 0 











w 


first: derive V En(w) 


Logistic Regression 





OEin(W) 


Gradient of Logistic Regression Error 


The Gradient V Ein(w) 
Ri xa 
En(w) = peu 1+ exp(—ynWw Xn) 
n=1 aal 























o Wi 


a ne (AP) (2 GO HE nn 








H= 
SO OC D) 
oe E) (ori) MOM 





Logistic Regression Gradient of Logistic Regression Error 





























The Gradient V Ein(w) 
, nee AN 
En(w) = peu 1+ exp(—ynw! Xn) 
OEn(w) 1 Ss /dln(Q)\ /A(1 + exp(O))\ (A — yow?x 
cram tens re | a core 





z D (5) (0) (or) 


exp(O) “Vee 
Ta - vn) UN 





N 


no 00) (—YnXni) 


n=1 















Logistic Regression Gradient of Logistic Regression Error 


Minimizing E;,(w) 


mip Ente) = pščin(1 resta?) 


want VEin(w) = 










scaled -weighted sum of 
e all 6(-) = 0: only if y,w?x, > 0 
—linear separable D 
Ein e weighted sum = 0: 
non-linear equation of w 








w 


closed-form solution? no :-( j 
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Logistic Regression Gradient of Logistic Regression Error 


PLA Revisited: Iterative Optimization 


PLA: start from some wọ (say, 0), and ‘correct’ its mistakes on D | 


Fort Ole? 
O find a mistake of w; called (Xn), Yin) 


sign (wx: F Ynt) 
@ (try to) correct the mistake by 


West — We + Yn(t)Xn(t) 


when stop, return last w as g 





Logistic Regression Gradient of Logistic Regression Error 


PLA Revisited: Iterative Optimization 


PLA: start from some wọ (say, 0), and ‘correct’ its mistakes on D 


Fort =0,1,... 
@ find a mistake of w; called (Xpat); Yin) 








sign ( IXni(1)) Z Ynt) 


@ (try to) correct the mistake by 
Witi — We + Yn(t)Xn(t) 
0 (equivalently) pick some n, and update w; by 
Wii Wet [sign (w7xn) # vn] YnXn 
when stop, return last w as g 


Logistic Regression Gradient of Logistic Regression Error 


PLA Revisited: Iterative Optimization 


PLA: start from some Wo (say, 0), and ‘correct’ its mistakes on D | 


Fort =O, 1500. 
@ (equivalently) pick some n, and update w; by 


Wii wit 1 - ( [sign (wrx) pi vn] eYnkn) 
ee em ve 


v 


n 

when stop, return last w as g 
choice of (n, v) and stopping condition defines 

iterative optimization approach | 


Machine Learning Foundations 
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Logistic Regression Gradient of Logistic Regression Error 


Fun Time 


Consider the gradient VEin(w) = 77 > 0 (—ynW" Xn) (—YnXn). That is, 
each example (Xn, yn) contributes ine gradient by an amount of 
0 (—ynw! Xn). For any given w, which example contributes the most 
amount to the gradient? 

@ the example with the smallest y,w’x, value 

@ the example with the largest yaw” Xn value 

© the example with the smallest w’x, value 

@ the example with the largest w’x, value 


Reference Answer: aD 


Using the fact that 0 is a monotonic function, 

we see that the example with the smallest 
ynw!x, value contributes to the gradient the 
most. | 
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Logistic Regression Gradient Descent 


Iterative Optimization 


Fort =0,1,... 
Wi Weny 





when stop, return last w as g 


e PLA: v comes from mistake correction 


e smooth Ein(w) for logistic regression: ki 
choose v to get the ball roll ‘downhill’? E 

e direction v: E 
(assumed) of unit length i 

e step size 7: Gi 








(assumed) positive Weights, w 





a greedy approach for some given ņ > 0: 







min Ein(Wi + nv) 
iIv||=1 a 
Wi 
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Logistic Regression Gradient Descent 


Linear Approximation 
a greedy approach for some given ņ > 0: 






min  En(w;--1v) 
lIvij1 


e still non-linear optimization, now with constraints 
—not any easier than minw Ein(w) 


e local approximation by linear formula makes problem easier 


Ein(We + 1V) ~ Ein(we) + zv V Ein(we) 


if z really small (Taylor expansion) 





an approximate greedy approach for some given smal! 7: 






min En(w)4 = o ov" VEn(w) 
v=. "— — mov < 
known given positive known 
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Logistic Regression Gradient Descent 


Gradient Descent 
an approximate greedy approach for some given small 7: 






min  E;(w;) v? VEn(Wt) 
iIvls1 ` sa {~v RE 
given positive known 





e optimal v: opposite direction of VE;,(w;) 


V Ein(We) 


v= = — o 
|| V Ein(Wr)|| 


i . V Ein (Wr) 
e gradient descent: for small 7, w,,, — Wẹ TVE (wT 








gradient descent: 
a simple & popular optimization tool | 
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Logistic Regression Gradient Descent 


Choice of 7 


too small 





just right 






In-sample Error, Fin 
In-sample Error, Fin 
In-sample Error, Fin 




















Weights, w 
too slow :-( 










Weights, w 
too unstable :-( 


Weights, w 
use changing 7 








n better be monotonic of || V En(wz)|| 


Logistic Regression Gradient Descent 


Simple Heuristic for Changing 7 


n better be monotonic of || V Ei,(w:)|| 


e if red 7 x ||VEin(wr)|| by ratio purple n 
n V Ein (Wr) 
|| V Ein(wr)| 


| 
w: — nV Ein(Wr) 


Wi — W;— 


e call purple 7 the fixed learning rate 





fixed learning rate gradient descent: 


Wi — Wr — nV En(wi) 





Logistic Regression Gradient Descent 
Putting Everything Together 


Logistic Regression Algorithm 


initialize Wo 
For t = 0,1,--- 
0 compute 
1A 
= 
VEn(w) = pio (omari xa) (oro) 
@ update by 


Wi — We — nV En(w:) 


„until VEj,(w;,4) = 0 or enough iterations 
return last w;,, as g 


similar time complexity to pocket per iteration | 





Logistic Regression Gradient Descent 


Fun Time 


If Wo = 0, and take 7 = 0.1. What is w; in the logistic regression 
algorithm? 

O t0.1- $ En- YoXn 

© -01. Ds YnXn 

© +0.05- 4 EN] YnXn 

© -0.05 - 4 EN] YnXn 


Reference Answer: © 


You can do a simple substitution using the fact 
that 0(0) = 4. This result shows that a scaled 

average of yYnXn is somewhat ‘one-step’ better 
than the zero vector. 
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Logistic Regression Gradient Descent 
Summary 
@ When Can Machines Learn? 
0 Why Can Machines Learn? 
© How Can Machines Learn? 















Lecture 9: Linear Regression 
Lecture 10: Logistic Regression 


ə Logistic Regression Problem 
P(+1|x) as target and 6(w’x) as hypotheses 
e Logistic Regression Error 
cross-entropy (negative log likelihood) 
ə Gradient of Logistic Regression Error 
0-weighted sum of data vectors 
e Gradient Descent 
roll downhill by —V En(w) 


e next: linear model‘S’ for classification 
@ How Can Machines Learn Better? 
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